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[0001] 

[0002] BACKGROUND 
1. Field 

[0003] The present disclosure pertains to the field of memory and computer memory 
systems and more specifically to error detection and correction for memory errors. 
[0004] 

[0005] 2. Description of Related Art 

Error correcting codes (ECC) have been routinely used for fault tolerance 
in computer memory subsystems. The most commonly used codes are the single 
error correcting (SEC) and double error detecting (DED) codes capable of 

correcting all single errors and detecting all double errors in a code 

word. 

As the trend of chip manufacturing is toward a larger chip capacity, more 
and more memory subsystems will be configured in b-bits-per-chip. The most appropriate 
symbol ECC to use on the memory are the single symbol error correcting (SbEC) and 
double symbol error detecting (DbED) codes, wherein "b" is the width(number of bits in 
output) of the memory device, that correct all single symbol errors and detect all double 
symbol errors in a code word. A memory designed with a SbEC-DbED code can 
continue to function when a memory chip fails, regardless of its failure mode. When 
there are two failing chips that line up in the same ECC word sometime later, the 
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SbEC-DbED code would provide the necessary error detection and protect the 
data integrity for the memory. 

Existing and imminent memory systems utilize eighteen memory devices. 
However, the present SbEC-DbED error correcting codes utilize 36 memory devices in 
order to provide chipfail correction and detection. Thus, the cost increases due to the 
added expense of 36 memory devices for error correcting purposes and they are inflexible 
because they do not scale (adapt) to the memory systems with eighteen memory devices. 
Furthermore, the various circuits for encoding and decoding the errors are complex. 
Thus, this increases the cost and design of computer systems to insure data integrity. 
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Brief Description of the Figures 

[0006] The present invention is illustrated by way of example and not limitation in 
the Figures of the accompanying drawings. 

[0007] Figure 1 illustrates a block diagram of a code word utilized in an embodiment. 
[0008] Figure 2 illustrates an apparatus utilized in an embodiment. 

Figure 3 illustrates a flowchart of a method utilized in an embodiment. 

Figure 4 illustrates an apparatus utilized in an embodiment described in 

connection with Figure 2. 

r ■ 

Figure 5 illustrates a system utilized in an embodiment. 
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Detailed Description 

[0009] The following description provides a method, apparatus, and system for error 
detection and correction of memory devices. In the following description, numerous 
specific details are { set forth in order to provide a more thorough understanding of the 
present invention. It will be appreciated, however, by one skilled in the art that the 
invention may be practiced without such specific details. Those of ordinary skill in the 
art, with the included descriptions, will be able to implement appropriate logic circuits 
without undue experimentation. 

As previously described, typical ECC code utilizes 36 memory devices for chipfail 
detection and correction that results in increased cost and design of a computer system. 
Also, with the advent of eighteen memory devices in a system, the present ECC codes do 
not scale. In contrast, the claimed subject matter facilitates a new ECC code, "adjacent- 
symbol" code that supports memory systems with 18 memory devices For example, in 
one embodiment, the claimed subject matter facilitates the ability for decoding and 
correcting memory errors in systems that utilize 18 memory devices for a memory 
transaction (memory rank). Furthermore, the claimed subject matter facilitates forming a 
code word of data with only two clock phases. Also, the adjacent-symbol ECC code 
corrects any error pattern within the data from one memory device and detects various 
errors (double device errors) from failures in 2 memory devices. 
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In one embodiment, the adjacent-symbol ECC code is utilized for a memory 
system with two channels of Double Data Rate (DDR) memory, wherein each channel is 
64 bits wide with eight optional bits for ECC. Also, the memory system may utilize x4 or 
x8 wide memory devices (x4 and x8 refers to the number of bits that can be output from 
the memory device). Thus, the claimed subject matter supports various configurations of 
memory systems. For example, a memory system with x8 devices would utilize 18 
memory devices per memory rank if ECC is supported, otherwise, 16 memory devices per 
memory rank if ECC is not supported. Alternatively, a memory system with x4 devices 
would utilize 36 memory devices per memory rank if ECC is supported, otherwise, 32 
memory devices per memory rank if ECC is not supported. 
[0010] 

[0011] Figure 1 illustrates a block diagram of a code word utilized in an embodiment. 
The block diagram 100 comprises an adjacent symbol codeword 106 to be formed from 
two clock phases of data 102 and 104 from a memory device. For example, in one 
embodiment, a memory access transaction comprises a transfer of 128 data bits plus an 
additional 16 ECC check bits per clock edge, for a total of 144 bits for each clock edge 
(288 bits for both clock edges). In a first clock phase 102, a first nibble "n0" and a 
second nibble "n2" of data from a memory are transferred and mapped to a first nibble of 
each of two symbols of the codeword 106. Subsequently, during a second clock phase 
104, a first nibble "nl" and a second nibble "n3" from a memory are transferred and 
mapped to a second nibble of each of two symbols of the codeword 106. Thus, the two 
symbols of the codeword 106 are adjacent and are on a 16 bit boundary of the code word, 
which are designated as "adjacent symbols", thus, the codeword 106 is an adjacent 
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symbol codeword. 

[0012] The scheme illustrated in the block diagram facilitates error detection and 
improves fault coverage of common mode errors. For example, for an x4 memory 
device, there is a one to one mapping of nibbles from the x4 memory device to a symbol 
in the underlying code word. In contrast, for a x8 memory device, there is a one to one 
mapping of nibbles from half of the x8 memory device to a symbol in the underlying code 
word Thus, the claimed subject matter facilitates isolating common mode errors across 
nibbles to the symbol level and results in increased fault coverage. Therefore, for the x8 
memory device, the claimed subject matter precludes aliasing for a second device failure. 
Likewise, device errors in the x4 memory devices are isolated to a single symbol in the 
codeword 106, thus, there is complete double device coverage for the x4 memory devices. 
[0013] To further illustrate, there are typically two classes of double device failures, 
simultaneous and sequence, that occur in the same memory rank. 

[0014] A simultaneous double device failure has no early sign warning because there 
is no indication of an error in a previous memory transaction. Typically, the computer 
system reports an uncorrectable error in the absence of an aliasing. However, the system 
might incorrectly report a correctable single device failure. This time the aliasing may be 
discovered in subsequent accesses because an error pattern might change as to preclude 
the alias. 

[0015] In contrast, a sequential double device failure is a more typical failure pattern 
than a simultaneous double device failure. Typically, the first device error is detected as a 
correctable error. For a second device failure, there may be two outcomes in one 
embodiment; the error is reported as uncorrectable, otherwise, the error is reported as a 
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correctable error at a new location. In the event of an uncorrectable error for the second 
device failure, the analysis is complete. Otherwise, the system changes the error location 
from the first device failure to the second device's failure location. Therefore, the 
preceding method for detecting the alias is accurate because it is unlikely that the first 
device failure location resolves itself and even less likely that is does at the simultaneous 
instant that the second device failure has failed. 

[0016] A few examples of double device errors that are always detected (no aliasing) 
are double bit errors, double wire faults, wire faults in one memory device with a single 
bit error in a second memory device, and a fault that affects only one nibble of each 
memory device. 

[0017] In one example of a device error for the x8 memory device, all 16 bits of the 
codeword (adjacent symbols) may be affected (corrupted) because the failure results in an 
error for both nibbles and both clock phases of the memory device's data. Thus, the 
claimed subject matter facilitates the correction of this device failure by first correcting 
the 16 bits that are in error. However, in the event of a second memory device failure, the 
code detects the error pattern in two groups of 16 bits which are aligned on 16-bit 
boundaries in the code word 106. 

[0018] Figure 2 illustrates an apparatus utilized in an embodiment. From a high-level 
perspective, the apparatus generates a code word by creating check bits to be appended to 
data that is forwarded to memory. Subsequently, the apparatus generates a syndrome 
based at least in part on decoding the code word received from memory and facilitates 
classifying errors and correcting the errors. In one embodiment, the code word from the 
memory device is an adjacent symbol codeword that was described in connection with 
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Figure 1. 

[0019] The apparatus comprises an encoder circuit 202, at least one memory device 
204, a decoder circuit 206, an error classification circuit 208, and a correction circuit 210. 
[0020] The encoder circuit receives data that is to be forwarded to the memory device 
or memory devices 204. The encoder circuit generates a plurality of check bits based at 
least in part on the data. Thus, a codeword is formed based at least in part on the plurality 
of check bits and the data and is forwarded to the memory device or memory devices 
204. 

In one embodiment, the check bits are generated from the binary form of a G-matrix, 
wherein the matrix has 32 rows and 256 columns to form 32 check bits. The check bits 
are computed as follows: 

ci = 2 dj x Gij for i=0 to 3 1 and j= 0 to 255 

For binary data, the multiply operation becomes an AND function and the sum 
operation the 1-bit sum or XOR operation. Thus, the resulting encoding circuit 
comprises 32 XOR, each tree computing one of the 32 check bits. 

Subsequently, the memory device or memory devices 104 returns data and the check 
bits back to the decoder circuit 106. In one embodiment, the decoder circuit generates a 
32-bit syndrome based at least in part on a 288-bit code word (as earlier described in 
connection with Figure 1 for the 288-bit code word). 

In one embodiment, the syndrome is generated from an H-matrix, wherein the 
matrix comprises 32 rows and 288 columns. Each syndrome bit is calculated as follows: 

s i= S Vj x Hy for i=0 to 31 and j= 0 to 287 

As previously described with the encoder circuit, the generation of the syndrome bits 

9 



Attorney Docket 042390 . P15881 

is simplified to a XOR operation over the code word bits corresponding to the columns of 
the H-matrix that have a binary 1 value. Thus, the decoding circuit comprises 32 XOR 
trees, each tree computing one of the 32 syndrome bits. Therefore, in one embodiment, a 
32 bit syndrome is generated by an H matrix receiving a 288 bit codeword. However, the 
claimed subject matter is not limited to this bit configuration. One skilled in the art 
appreciates modifications to the size of the syndrome and codeword. 

The error classification and error correction are described in connection with Figure 4. 

Figure 3 depicts a flowchart for a method utilized in an embodiment. The flowchart 
depicts a method for detecting whether there were errors in data in a transaction with a 
memory device or devices. A first block 302 generates check bits to be appended to data 
for forwarding to a memory device or devices. An adjacent symbol codeword is 
generated based at least in part on data received from the memory device or devices to be 
utilized for checking the integrity of the data, as depicted by a block 304. A decoder 
generates a syndrome based at least in part on the adjacent symbol codeword, as depicted 
by a block 306. In the presence of an error as determined by the syndrome, an error 
classification and correction is performed, as depicted by a block 308. 

Figure 4 illustrates an apparatus utilized in an embodiment described in 

connection with Figure 2. As previously described, Figure 4 describes one embodiment , 

v 

of the error classification and error correction in connection with Figure 2. 

The error classification is based at least in part on the decoding circuit's 
computation of the syndrome. For example, in one embodiment, if the syndrome (S) 
==0, then there is NO error. Otherwise, if the syndrome (S) > 0, there is an error. Also, 
it is optional to further classify the error by computing an error location vector L. For 
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example, in one embodiment, the error is uncorrectable if L==0. Otherwise, the error is 
correctable in an indicated column if L>0. Furthermore, one may further classify the 
correctable error as whether the error occurs in a data column or check column. For 
example, if the error is in a check column, the data portion of the code word may bypass 
the correction logic. 

In yet another embodiment, a single device correctable error may be classified 
based at least in part on a weight of the error value. As depicted in Figure 4, an adjacent 
pair may generate error values e G and ej . Thus, the error locator vector L is then used to 
gate the error values on a plurality of busses, 402 and 404 because the circuits allow for 
the error locator bits for one adjacent pair will be enabled for a given error pattern. 

Thus, the claimed subject matter allows for test coverage of both single and 

double device errors. 

[0021] 

[0022] Figure 5 depicts a system in accordance with one embodiment. The system in 
one embodiment is a processor 502 that is coupled to a chipset 504 that is coupled to a 
memory 506. For example, the chipset performs and facilitates various operations, such 
as, memory transactions between the processor and memory and verifies the data integrity 
by utilizing the adjacent symbol codeword as described in connection with Figure 1. In 
one embodiment, the chipset is a server chipset to support a computer server system. In 
contrast, in another embodiment, the chipset is a desktop chipset to support a computer 
desktop system. In both previous embodiments, the system comprises the previous 
embodiments depicted in Figures 1-4 of the specification to support the adjacent symbol 
codeword and error correction and detection methods and apparatus. 
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[0023] While certain exemplary embodiments have been described and shown in the 
accompanying drawings, it is to be understood that such embodiments are merely 
illustrative of and not restrictive on the broad invention, and that this invention not be 
limited to the specific constructions and arrangements shown and described, since various 
other modifications may occur to those ordinarily skilled in the art upon studying this 
disclosure. 
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